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University of Pennsylvania School of Medicine 

Graphs and networks are common ways of depicting biological 
information. In biology, many different biological processes are rep- 
resented by graphs, such as regulatory networks, metabolic pathways 
and protein-protein interaction networks. This kind of a priori use of 
graphs is a useful supplement to the standard numerical data such as 
microarray gene expression data. In this paper we consider the prob- 
lem of regression analysis and variable selection when the covariates 
are linked on a graph. We study a graph-constrained regularization 
procedure and its theoretical properties for regression analysis to take 
into account the neighborhood information of the variables measured 
on a graph. This procedure involves a smoothness penalty on the 
coefficients that is defined as a quadratic form of the Laplacian ma- 
trix associated with the graph. We establish estimation and model 
selection consistency results and provide estimation bounds for both 
fixed and diverging numbers of parameters in regression models. We 
demonstrate by simulations and a real data set that the proposed 
procedure can lead to better variable selection and prediction than 
existing methods that ignore the graph information associated with 
the covariates. 

1. Introduction. There has been a growing interest in penalized least 
squares problems via L\ or other types of regularization, especially in high- 
dimensional settings. Important penalty functions that can lead to sparse 
variable selection in regression include Lasso [Tibshirani (1996)] and SCAD 
[Fan and Li (2001)]. In particular, Lasso has the crucial advantage of being 
a convex problem, which leads to efficient computational algorithms by co- 
ordinate descent [Efron et al. (2004); Friedman et al. (2007); Wu and Lange 
(2008)] and sparse solutions. Zou (2006) proposed a novel adaptive Lasso 
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procedure and presented results on model selection consistency and oracle 
properties of the parameter estimates. Zhao and Yu (2006) presented the 
irrepresentable condition for model selection consistency of Lasso. Zhang 
and Huang (2006) studied the sparsity and bias of the Lasso selection in 
high-dimensional linear regression. Fan and Li (2001) and Huang and Xie 
(2007) established the asymptotic oracle properties of the SCAD-penalized 
least squares estimators when the number of covariates is fixed or increases 
with the sample sizes. These novel penalized estimation methods are quite 
effective in selecting relevant variables and in predicting future outcomes, 
especially in high-dimensional settings. 

New estimation procedures have also been developed in recent years to 
account for certain structures of the explanatory variables. These include the 
group Lasso procedure [Yuan and Lin (2006)] when the explanatory variables 
are grouped or organized in a hierarchical manner, the elastic net (Enet) pro- 
cedure [Zou and Hastie (2005)] that deals with groups of highly correlated 
variables, and the fused Lasso [Tibshirani et al. (2005)] that imposes the L\ 
penalty on the absolute differences of the regression coefficients in order to 
account for some smoothness of the coefficients. Nardi and Rinaldo (2008) 
established the asymptotic properties of the group Lasso estimator for linear 
models. Jia and Yu (2008) provided conditions for model selection consis- 
tency of the elastic net when p 3> n. Zou and Zhang (2009) proposed an 
adaptive elastic net with a diverging number of parameters and established 
its oracle property. Among these procedures, the Enet regularization and the 
fused Lasso are particularly appropriate for the analysis of genomic data, 
where the former encourages a grouping effect and the latter often leads to 
smoothness of the coefficient profiles for ordered covariates. 

Motivated by a genomic application in order to account for network infor- 
mation in the analysis of genomic data, Li and Li (2008) proposed a network- 
constrained regularization procedure for fitting linear regression models and 
for variable selection, where the predictors in the regression model are ge- 
nomic data that are measured on the genetic networks, which we call the 
graph-structured covariates. In particular, we assume that the covariates in 
the regression model are values of the nodes on a graph, where a link be- 
tween two nodes may indicate a functional relationship between two genes 
in a genetic network or physical neighborhood between two voxels on brain 
images. Since many biological networks are constructed using data from 
high-throughput experiments, often there is a probability associated with 
an edge to indicate the certainty of a link. Such an edge probability can be 
used as a weight in a undirected graph, in which case we have a weighted 
graph. This graph-constrained regularization procedure is similar in spirit 
to the fused Lasso [Tibshirani et al. (2005)], both of which try to smooth 
the regression coefficients in certain ways. However, the fused Lasso does not 
utilize prior graph information. Second, instead of using the L2 norm on the 
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differences of the coefficients of the linked variables, the fused Lasso uses the 
L\ norm on the differences, which tends to lead to the same regression co- 
efficients for linked variables. In this paper we consider the general problem 
of regression analysis when the explanatory variables are nodes on a graph 
and present a cyclical coordinate descent algorithm [Friedman et al. (2007)] 
to implement the network-constrained regularization procedure of Li and Li 
(2008). This algorithm provides new insight on how neighboring variables 
affect the coefficient estimate of a node. We also extend the procedure of Li 
and Li (2008) to account for the possibility of different signs of the regression 
coefficients for neighboring variables. In addition, we provide theoretical re- 
sults of the estimates, including sign consistency and error bounds of the 
estimator and Li consistency. 

This paper is organized as follows. In Section 2 we describe the problem 
of regression analysis with covariates measured on graphs. We then present 
a graph-constrained estimation (Grace) procedure in order to account for 
the graph structures in Section 2.1 and an efficient coordinate descent algo- 
rithm to implement the proposed regularization methods in Section 2.3. We 
present the estimation and model selection consistency results in Section 3. 
We provide Monte Carlo simulation results in Section 4 and results from 
the application to the analysis of a data set on the gene expression of brain 
aging in Section 5. Finally, we give a brief discussion of the methods and 
results. 

2. Regression analysis for covariates measured on a graph. Consider a 
weighted graph G = (V, E, W), where V = {1, . . . ,p} is the set of vertices 
that correspond to the p predictors, E = {u ~ v } is the set of edges indicat- 
ing that the predictors u and v are linked on the graph and there is an edge 
between u and v, and W is the set of weights of the edges, where w(u, v) de- 
notes the weight of edge e = {u ~v). In genomic studies, biological networks 
are often represented as graphs, an edge between u and v on the graph can 
indicate some functional relationship between them and the weight can be 
used to measure the uncertainty of the edge between two vertices, for exam- 
ple, indicating the probability of having an edge between two variables when 
the graph is constructed from data. For each given sample, we assume that 
we have numerical measurements on each of the vertices and these measure- 
ments are treated as explanatory variables in a regression analysis frame- 
work. For the uth node, let Xi u be the numerical measurement of the uth 
vertex on the graph for the zth individual. Further, let x w — (^iu, • • • ,^rm) 
be the measured values at the uth vertex for n i.i.d. samples. Consider the 
problem of variable selection and estimation where we have design matrix 
X = (xi,x 2 , . . . ,x p ) G Jl nxp and response vector y = (yi,y 2 ,.. .,y n ) T G 7Z n , 
and they follow a linear model 




y = XP + e 
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where e = (ei, . . . , e n ) T ~ iV(0, cr 2 I n ) and /3 = . . . , /3 p ) T . Throughout this 
paper we assume that the predictors and the response are centered so that 

n n 1 n 

^2 Vi = °> X] Xi 3 = °' and ~ ^2 x % = 1 for j = 1, . . . ,p. 

i=l i=l i=l 

In this paper we consider that the design matrix X is a deterministic matrix 
in the fixed design settings. 

When p is large, we assume that model (2.1) is "sparse," that is, most 
of the true regression coefficients (3 are exactly zero. Without loss of gen- 
erality, we assume the first q elements of vector j3 are nonzeroes. Denote 
P{i) = (Pi,-- ■ ,Pq) T and /3( 2 ) = (P g +i, ■ • • ,/3 p ) T , then element- wise ^ / 
and (3(2) = 0. Now write X^) and X( 2 ) as the first q and last p — q columns 
of X, respectively, and let C = ^X T X, which can then be expressed in the 
following block- wise form: 

C = ( ^ n ^ 12 1 
\C2l C22 / 

The goal of this paper is to develop a regularization procedure for selecting 
the true relevant variables. Different from the existing approaches, we par- 
ticularly account for the fact that the explanatory variables are related on 
a graph. We make this more precise in the next section. 



2.1. Graph- constrained regularization and variable selection. In order to 
account for the fact that the p explanatory variables are measured on a 
graph, we first introduce the Laplacian matrix [Chung (1997)] associated 
with a graph. Let the degree of the vertex v be d v = ^2 U ^ V w(u, v) . We say u 
is an isolated vertex if d u = 0. Let w(u,u) = 0. Following Chung (1997), we 
define the Laplacian matrix L for graph G with the uv th element defined by 

{1 — w(u, u)/d u , if u = v and d u 7^ 0, 

—w(u,v) / y/d u d v , if u and v are adjacent, 
0, otherwise. 

It is easy to verify that this matrix is positive semi-definite with as the 
smallest eigenvalue and 2 as the largest eigenvalue when all the weights are 
1 [Chung (1997)]. To allow the matrix to change with n, we further express 
this matrix in block-wise form, 

l =(l 1 l 12 V 

\ lj 21 L22 / 

where Ln corresponds to the q nodes that are relevant to the response and 
L22 corresponds to the p — q nodes that are not relevant. 
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The Laplacian matrix has the following interpretations. For a given vector 
j3, the edge derivative of /3 along the edge e(u,v) at u is defined as 



The smoothness of vector f3 with respect to the graph structure can be 
expressed as 



This variation functional for vectors f3 penalizes vectors that differ too much 



over nodes that are linked. It contains a scaling by \fdu- One intuitive reason 

for such a scaling is to allow a small number of nodes with large d u to 
have more extreme values of f3 u while the usually much greater number of 
nodes with small d u should not ordinarily allow to have very large f3 u . This 
variation functional has been widely used in semi-supervised learning on 
graphs [Zhu (2005); Zhou et al. (2004)]. 

For many problems with covariates measured on a graph, we would expect 
that the neighboring variables are correlated and, therefore, the regression 
coefficients would show some smoothness. One way to account for such a 
dependence of the regression coefficients is to impose a Markov random field 
(MRF) prior to the collection of f3 vectors. The MRF decomposes the joint 
prior distribution of the /3 u 's into lower-dimensional distributions based on 
the graph-neighborhood structures. A common MRF model is the Gaussian 
MRF model that assumes that the joint distribution of (3 is given by 



which is an improper density. Based on this Gaussian MRF prior assump- 
tion, Li and Li (2008) introduced the following graph-constrained estimation 
of the regression coefficients, denoted by /3, 




and, therefore, the local variation of f3 at u can be measured by 







(2.3) 



/3 = argminQ(/3,Ai,A 2 ) 



where 



Q(J3, Ai, A 2 ) = ||y - X/3||| + Ai||0||i + A 2 /3 r L/3 
= (y-X/3) T (y-X/3) + A 1 ^|/3 ti 



u 
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where L is the Laplacian as defined in (2.2) and the tuning parameters Ai, A2 
control the amount of regularization for sparsity and smoothness. For the 
special case when A2 = 0, the estimate reduces to the Lasso, and when L is 
the identity matrix, the estimate reduces to the elastic net estimates. 



2.2. An adaptive graph- constrained regularization. The Grace procedure 
may not perform well when two variables that are linked on the graph have 
different signs in their regression coefficients, in which case the coefficients 
are not expected to be locally smooth. For example, for genetic networks, two 
genes might be negatively correlated with the phenotypes and are therefore 
expected to have different signs in their regression coefficients. In order to 
account for the sign differences, we can first perform a standard least square 
regression when p < n or the elastic net regression when p>n and denote 
the estimate as /3. We can then modify the above objective function as 



Q*(Ai, A 2 ,/3) = ||y - xp\\i + Ai||/% + A 2 /5 J L*/3, 

p 



X/3|li + Ai]T|/% 



r- 



2 



v-^/'sign(/3 u )/3 u sign(/3„)/V. 
^£(— ^ 7 =^) W (u,v), 



where 



{1 — w(u,u)/d u , if u = v and d u ^ 0, 

— sign(/3 u ) sign((3 v ) w (u, v) / \/d u d v , if u and v are adjacent, 
0, otherwise. 

Note that the L* matrix is still positive semi-definite. We call the /3 defined 
by 

(2.4) /3 = argminQ*03,Ai,A 2 ) 

P 

the adaptive Grace (aGrace). 



2.3. A coordinate descent algorithm. Friedman et al. (2007) presented 
a coordinate descent algorithm for solving the Lasso and the Enet regu- 
larization. In this section we develop a similar algorithm for the proposed 
graph-constrained regularization. We only present the detailed algorithm for 
the optimization problem defined by equation (2.3). Similar algorithms can 



VARIABLE SELECTION FOR GRAPH COVARIATES 



7 



be developed by the aGrace defined by (2.4). If we let A = (Ai + 2A2) /2n 
and a = Ai/(Ai + 2A2), the Grace can be written as 



(2.5) 
where 



P(X,a) = argmin{#(/3) := ^-\\y - X/3||| + XP a (fi) 
a 2n 



P a {fi) := (1 - a) Vl/3 + a^h = (1 - a)\ £ 



Pu_ 



Pv 



+ aJ2\p u \ 



u=l 



is the graph-constrained penalty function. 

Given a covariate x M , suppose we have estimated f3 v for v 7^ u, and we 
want to partially minimize the objective function with respect to (3 U . We 
can rewrite the objective function in (2.5) as 

2 1 / a \ 2 



R (fl = ^ E - E x ™Pv - + a(i - E 

1=1 ^ v=£u v~u 



+ Aa|A»| + A(l-a)- 

w ~u 



Pu Pv 



Pa 



Pv_ 

\fd~v 



Xa ^2 \p w \ 



We would like to compute the gradient at p u , which only exists when p u ^ 0. 
We first consider the case that the covariate u is connected to some other 
nodes (variables) on the network. If P u > 0, due to the standardization of 
the covariates, we can differentiate the objective function R(P) with respect 
to p u as 



dR 

Wu 



i E (v* - E + A a - «) E 7^ 



+ Ac* + [1 + A(l - a)]p u . 

Similarly, we can get the corresponding expression when p u <0. Following 
the calculus by Donoho and Johnstone (1994) and Friedman et al. (2007), 
we obtain the coordinate-wise update form for p u as 



s (( 1 / n ) Ei=i x iu(Vi -Vj ) + K l ~ a ) J2v~ u (Pv/Vd u d v ), Xa) 

l + A(l-a) 



(2.6) p u 
where: 

• y\ = Yl v ^u x ivPv is the partial residual for fitting p u , that is, the fit- 
ted value excluding the contribution from Xi u . Since the covariates are 
standardized, - Yli=i x iu(yi — Y!v^u x ivPv) is the simple least-squares co- 



efficient while fitting the partial residual to Xi 



1 n. 
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• S(z, r y) is the soft-thresholding operator with value 

{z — 7, if z > and 7 < \z\, 
z + ^j, if z < and 7 < jzj, 
0, otherwise. 

When covariate u is not connected to other nodes on the network, that 
is, when it has no neighbors, the corresponding coordinate-wise updating 
formula becomes the Lasso updating formula, that is 

(2.7) p u ^st-^UVi ~ A") • 

^ n i=i ' 

Comparing the two updated forms of (2.6) and (2.7), an intuitive expla- 
nation can be drawn to help to understand the effect of the graph-constraint 
penalty on the coefficients. For an isolated predictor, the graph penalty is 
vanished and, thus, we only apply a soft-thresholding for the Lasso penalty, 
while for a connected predictor, form (2.6) takes into account the graph- 
constraint to the penalty by adding the scaled summation of the coefficients 
of the neighboring covariates to the simple least-squares coefficient and ap- 
plying a proportional shrinkage for the graph penalty. 

Given a, we can compute the solution path for a decreasing sequence of 
values for A, starting from the smallest value A max for which there is no 
covariate selected, that is, (3 = 0. Similar to Friedman et al. (2007), we can 
set A max = m&xi\(xi,y)\/na, A m j n = eA max and construct a sequence of K 
values of A decreasing from A max to A m i n on the log scale. Typical values are 
e = 0.001 and K = 100. Cross-validation (CV) can be used to select the two 
tuning parameters a and A. 

3. Error bound and model selection consistency for fixed and diverging p. 

In this section we provide some theoretical results on the proposed Grace 
procedure, including the error bounds, L2 consistency of Grace and the 
model selection consistency for both fixed and diverging p when p tends to 
infinity with the sample size n. Our theoretical development follows that of 
Zhao and Yu (2006), Jia and Yu (2008) and Zou and Zhang (2009) on sign 
consistency of Lasso and adaptive elastic net estimates. In our theoretical 
analysis, we assume the following regularity conditions throughout: 

(Al) We use A m ; n (M) and A max (M) to denote the minimum and maxi- 
mum eigenvalues of a positive definite matrix M, respectively. We further 
assume that C = ^X T X is positive definite and 

b < A min (C) < A max (C) < B, 

where b and B are two positive constants that do not depend on n. 
(A2) i maxi<i< n Y%=i x ij ~> °> as n ~> 00 • 
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These two conditions assume that the predictor matrix has a reasonably 
good behavior and were also assumed in Zhao and Yu (2006) and in Zou 
and Zhang (2009). Condition (Al) is also the condition (F) in Fan and Peng 
(2004) and condition (A2) ensures that the rows of the matrix X behave like 
a sample from a probability distribution in TV [Portnoy (1984)]. These two 
conditions hold naturally if one assumes that Xi are i.i.d. with finite second 
moments. 



3.1. Error bound and L2- consistency of Grace. We first provide the fol- 
lowing nonasymptotic risk bound for the Grace of the regression coefficients 
defined by (2.1) for any p and n: 

Theorem 3.1. Given the data (y, X), define the Grace as 

/3(Ax, A 2 ) = argmin{||y - X/3||| + X^h + A 2 /3 T L/3}, 

P 

for nonnegative tuning parameters \\ and A 2 . Then under the regularity 
condition (Al), we have 

4A|AL x (L)||/3( 1) ||i + 4pn J Ba 2 + 2A^ 



(3.1) Em*xM-m< 



n 



2 ALn(C + (A 2 /n)L) 



The proof of this theorem is given in Li and Li (2010). Note that this 
result is not asymptotic and holds for any p and q <p. From this theorem, 
under the regularity assumption (Al) and the following further assumptions 
on p and the tuning parameters Ai, A 2 : 

(A3) li mjwoo £ = 0, 
(A4) lim^ ^ = 0, 

(A5) lim^ £ = and li m?woo MMk = 0> 



we have 



||/3(A l5 A 2 )-/?||! AO, 



which implies that the Grace of /? is L 2 consistent. This result implies that 
the Grace procedure chooses the important variables with high probability 
and that falsely chosen variables by Grace have very small coefficients. The 
L 2 consistency result suggests that we may use some hard-thresholding pro- 
cedure to further eliminate the variables with very small Grace coefficients. 
Alternatively, an interesting randomized selection procedure proposed by 
Bickel, Ritov and Tsybakov (2008) can be used to further eliminate the 
variables with small estimated Grace coefficients. Note that under the clas- 
sical setting where p, q and are all fixed as n— > 00, the assumptions 
(A3)-(A5) hold when Aj/n -> 00, i = 1, 2. 
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3.2. Model selection consistency when p is fixed. We next establish the 
results on model selection consistency for the standard case where p and q 
are fixed when n — > oo. Following Zhao and Yu (2006), we define the Grace 
of (3 to be sign consistent if there exists Ai and A2 as functions of n such 
that 

lim Pr(sign(/3(Ai, A 2 )) = sign(/3)) = 1. 

71— >00 

To establish the sign consistency of the Grace, we first provide the following 
graph-constrained irrepresentable condition (GC-IC): there exists 7] > and 
Ai > 0, A2 > 0, such that 

-1 



(3.2) 



C21 H — -L21 

n 



<l-r/, 



Cn H — -Ln 

n 



2A2 

sign(/3 ( i)) + —LxiPh) 

~ ^ L 21P(1) 



where 1 is a vector of Is with length p — q and the inequality holds element- 
wise. Further, we assume that C — > Co, where Co is positive definite. The 
GC-IC is a consequence of the Karush-Kuhn-Tucker (KKT) conditions for 
the following constrained optimization problem that corresponds to the pe- 
nalized optimization problem of equation (2.5): 

/3(A, a) = argmin/^ ||y - X/3f 2 : P a (/3) < A 
p [2n 



Theorem 3.2. For fixed p,q and /?, if C — > Co, where Cq is positive 
definite and condition (A. 2) holds, the graph- constrained estimate is sign 
consistent if and only if GC-IC (3.2) holds for Ai, A2 that satisfy Ai/i/n — > 00 
and Xi/n — > 0, for i = 1, 2. 

This theorem is a special case of Theorem 3.3 and its proof is similar 
to that of Zhao and Yu (2006) for Lasso estimates. We therefore omit its 
proof in this paper. Note that the required conditions on the sparsity tuning 
parameter Ai are the same as those for the Lasso [Zhao and Yu (2006)], 
for example, Ai = i/nlogn is a suitable choice. This theorem indicates that 
under some restrictive conditions of the design matrix and the Laplacian 
matrix of the network, the sign consistency property holds for the graph- 
constrained regularization. To gain further insight into GC-IC, consider the 
special cases when A2 is preselected and fixed and when Ai goes to infinity, 
the GC-IC reverses back to the irrepresentable condition for the Lasso given 
in Zhao and Yu (2006) and the graph-constrained penalty function Ai||/3||i + 
\ 2 (3 T L(3 = Ai(||/3||i + ^/3 T L/3) is reduced to the Lasso penalty. 



VARIABLE SELECTION FOR GRAPH COVARIATES 



11 



3.3. Model selection consistency when p diverges. We now consider the 
model selection consistency of the graph-constrained regularization proce- 
dure under the settings when the number of covariates p also goes to infinity 
as n — > oo, in which case, the assumptions and the regularity conditions for 
Theorems 3.1 and 3.2 become inappropriate as C does not converge and f3 
may change as n grows. The following theorem shows that for the general 
scalings when p, q and n all go to infinity, under some additional conditions 
between p, q and n, GC-IC also guarantees that the Grace is sign consistent 
in selecting the true model. 

Theorem 3.3. Suppose each column o/X is normalized to the L2-norm 
of n and GC-IC (3.2) holds. Define p := min |(C U + ^Ln^tCii^i))! 
and C m i n = A m i n (Cn), where A m i n (-) denotes the minimal eigenvalue. Let 
W max = max UjV {w(u,v)}. If M and A2 are chosen such that: 

(a) //L 12 = 0, 

Af 

nlog(p-q) 

or if L12 / 0, 

A| 

y 00 

log(p - q)(n + AlW^ ax /(nC min )) 

( b ) V H^/Wn + ^IIC0 X1 + ^Ln)" 1 sign(/3 (1) )|U} -+ 0, 
then the Grace /3(Ai,A2) is sign consistent as n— >oo. 

A proof analogous to Jia and Yu (2008) can be found in Li and Li (2010). 
Theorem 3.3 gives a general sign consistency result for the Grace for general 
scalings of p, q and n. If C m i n > a for some a > and p < po for some 
po > 0, it is easy to check that the conditions logq/n — > 00 and Xi^/q/n — > 
guarantee that condition (b) in Theorem 3.3 holds. In the settings when 
p and q are fixed, if Cn converges to a nonnegative definite matrix Cno, 
p converges to a nonnegative number. In addition, it is easy to verify that 
the conditions in Theorem 3.2 guarantee that the conditions (a) and (b) in 
Theorem 3.3 hold. 

4. Monte Carlo simulations. We conducted Monte Carlo simulations to 
evaluate the proposed Grace and aGrace procedures and to compare the 
performance of this procedure with Lasso and Enet in terms of prediction 
errors and identification of relevant variables. We simulated the graph to 
mimic gene regulation modules. We used genes and variables interchangeably 
in this section. We assumed that the graph consisted of 200 unconnected 
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regulatory modules with 200 transcription factors (TFs) and each regulated 
10 different genes for a total of 2200 variables. Among these modules and 
genes, we further assumed that four TFs and their 10 regulated genes (for 
a total of 44 variables) were associated with the response based on the 
following model: 

44 

(4.1) Y = Y J PuX u + e. 

u=i 

We considered two different models. For the first model, we assumed that 
the coefficients in model (4.1) were specified as 




10 



and the e was random mean-zero normal error with variance a 2 = ^2 u Pu/^- 
For each TF, the X value was simulated from a iV(0, 1) distribution, and 
conditional on the value of the TF, we simulated the expression levels of 
the genes that they regulated from a conditional normal distribution with 
correlations of 0.2, 0.5 and 0.9, respectively. We therefore had a total of 
2200 variables and 44 of them were relevant. For the second model, we 
considered the case when the regulated genes had different signs in regression 
coefficients, where the regression coefficients in model (4.1) have the same 
absolute values as in Model 1, but for each simulated module, three out of 
the 10 genes regulated by the TF had different signs from the other 7 genes. 
The X values were generated in the same way as in previous simulations. 
In this model, genes that are regulated by the same transcription factor are 
assumed to have different regression coefficients. 

For each model, our simulated data consisted of a training set, an inde- 
pendent validation set and an independent test set with a sample size of 
200 for all three data sets. Models were fitted on training data only, and the 
validation data were used to select the tuning parameters. We computed the 
prediction mean-squared errors on the test data set. For each model, we re- 
peated the simulations 100 times. Table 1 shows the prediction mean-square 
errors for several different procedures. For Model 1 when the neighboring 
genes have the same signs in regression coefficients, we observed that the 
Grace gave the smallest prediction errors for all four models with different 
correlations among the predictors. Both Grace and Enet performed better 
than Lasso in prediction. When the correlation is very high, the prediction 
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Table 1 

Comparison of prediction mean-square errors (SE) using Grace, aGrace, Enet and Lasso 

for three different correlation structures of 0.2, 0.5 and 0.9 between the transcription 
factors and their regulated genes for each of the two models considered. The results are 

based on 100 replications 



Method 




Model 1 (Cor 


) 




Model 2 (Cor) 




0.2 


0.5 


0.9 


0.2 


0.5 


0.9 


Grace 


24.93 


23.22 


22.56 


53.08 


42.07 


28.20 




(2.97) 


(2.41) 


(2.20) 


(6.45) 


(5.03) 


(2.87) 


aGrace 


24.93 


23.22 


22.56 


27.70 


26.23 


25.55 




(2.97) 


(2.41) 


(2.20) 


(3.66) 


(3.03) 


(2.76) 


Enet 


51.33 


37.37 


25.82 


56.18 


45.65 


27.33 




(6.65) 


(4.69) 


(2.67) 


(7.22) 


(5.81) 


(2.72) 


Lasso 


53.41 


40.30 


27.82 


57.62 


47.65 


29.23 




(6.68) 


(4.94) 


(2.98) 


(7.01) 


(5.53) 


(2.78) 


errors 


of these procedures were 


comparable, 


however, 


Grace still \ 


$ave the 



smallest prediction error among the procedures compared. When the signs 
of the regression coefficients were the same, aGrace was reduced to Grace 
and gave the same prediction results. For Model 2 when the neighboring 
variables have different signs of coefficients, aGrace adjusting for the signs 
of the regression coefficients gave the smallest prediction errors, further in- 
dicating the importance of adjusting for the signs in the regularization. In 
general, Grace gave similar prediction results as the Enet, except when the 
correlation between the transcription factors and their regulated genes was 
very high, in which case Enet resulted in a slightly smaller prediction error. 

To compare the performance on variable selection, Figure 1 shows the re- 
ceiver operating characteristic (ROC) curves of several different procedures 
in selecting the relevant variables for the models with correlation of 0.2 and 
0.9 between the TF and their regulated genes. For Grace, aGrace and Enet, 
the ROC curves were obtained as a function of the sparsity parameter Ai 
with tuning parameter A2 selected based on 5-fold cross-validation among 
the values of 0.1, 1, 10, 100 and 1000. For Model 1 when the neighbor- 
ing genes have the same signs in regression coefficients [Figure 1 plots (a) 
and (b)], Grace gave much larger areas under the ROC cruves than Enet 
and Lasso, indicating better performance in variable selection for Grace. 
In addition, five-fold cross-validation always chose the largest A2 = 1000 for 
Grace and A2 = 0.1 for Enet in all 100 replications. For Model 2 when the 
neighboring variables have different signs of coefficients, aGrace adjusting 
for the signs of the regression coefficients performed better than the other 
three procedures compared, resulting in larger areas under the curves, and 
Grace still performed better than Lasso and Enet on variable selections in 
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1 -specificity 1 -specificity 



Fig. 1. Comparison of ROCs for Grace, aGrace, Enet and Lasso for Model 1 [plots (a) 
and (b)/ and Model 2 [plots (c) and (d)/ and for correlations of 0.2 [plots (a) and (c)/ and 
0.9 [plots (b) and (d)/. The ROCs were calculated as a function of the sparsity parameter 
Ai . For Grace, aGrace and Enet, the tuning parameter A2 was selected based on 5-fold CV. 

both low and high correlation scenarios. When the correlation among the 
relevant variables is low, the 5-fold CV selected A2 = 1000 for aGrace and 
A2 = 0.1 for Enet in all 100 replications and selected A2 = 100 for Grace in 
most of the replications. When the correlation among the relevant variables 
was high, the 5-fold CV selected A2 = 1000 for aGrace and A2 = 100 for 
Grace in most of the 100 replications and selected A2 = 0.1 for Enet in all 
the replications. 
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5. Application to network-based analysis of gene expression data. To 

demonstrate the proposed method, we consider the problem of identifying 
age-dependent molecular modules based on the gene expression data mea- 
sured in human brains of individuals of different ages published in Lu et al. 
(2004). In this study the gene expression levels in the postmortem human 
frontal cortex were measured using the Affymetrix arrays for 30 individuals 
ranging from 26 to 106 years of age. To identify the aging-regulated genes, 
Lu et al. (2004) performed simple linear regression analysis for each gene 
with age as a covariate. We analyzed this data set by combining the KEGG 
regulatory network information with the gene expression data [Kanehisa 
and Goto (2002)]. In particular, we limited our analysis to the genes that 
can be mapped to the KEGG regulatory work and focused on the problem 
of identifying the subnetworks of the KEGG regulatory network that are 
associated with human brain aging. By merging the gene expression data 
with the KEGG regulatory pathways, the final KEGG network includes 1305 
genes and 5288 edges. 

We treated the logarithm of the individual age as the response variable 
and the expression levels (after log 10 transformation) of 1305 genes on the 
KEGG network as the explanatory variables in our analysis. Table 2 shows 
the results of several different procedures where the tuning parameters were 
selected by five-fold cross-validations. Overall, we observed that the Lasso 
selected the fewest number of genes with relatively large cross-validation 
errors and Grace and Enet selected roughly the same number of genes with 
similar CV errors. However, the adaptive Grace resulted in more identified 
genes with similar CV errors than the other two procedures. Figure 2 shows 
the subnetworks identified by four different estimation procedures. As a com- 
parison, we also included the genes selected by Lasso, although it did not 
select any linked pairs of genes on the KEGG network. It is interesting to 
note that as we impose more constraints on the regression coefficients, more 
linked genes are identified. Both Enet and Grace identified some common 

Table 2 

Results of analysis of brain aging gene expression data by four different procedures, 



including the number of genes 


selected (No. 


genes), the number of linked KEGG edges 


(No. e 


dges), the five-fold cross- 


validation error (CV error) and the 


values of the tuning 




parameters selected(\2 


for Grace, a 


Grace and Enet and si 






No. genes 


No. edges 


CV error 


Tuning parameters 


Grace 


45 


9 


0.079 


A 2 = 0.1, si =4.72 


aGrace 


73 


28 


0.077 


A 2 = 0.01, si =6.97 


Enet 


41 


10 


0.077 


A 2 = 1.0, S! = 5.64 


Lasso 


IS 





0.098 


si = 5.65 
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subnetworks that were associated with brain aging. These included fibrob- 
last growth factors (FGF) and its receptors. It has been demonstrated that 
FGFs are associated with many developmental processes including neural 
induction [Bottcher and Niehrs (2005)] and are involved in multiple func- 
tions including cell proliferation, differentiation, survival and aging [Yeoh 
and de Haan (2007)]. It is also interesting to observe that mitogen-activated 
protein kinase (MAPK) (MAPK1 and MAPK9) and the specific MAPK ki- 
nase (MAP2K) were also identified by Enet and Grace. The MAPKs play 
important roles in induction of apoptosis [Hayesmoore et al. (2009)]. Other 
interesting genes include RAS protein-specific guanine nucleotide-releasing 
factor 1 (RASGRF1), the functionality of which is highly significant in vari- 
ous contexts of the central nervous system. In the hippocampus, RASGRF2 
has been shown to interact with the NR2A subunits of NMDARs, trigger- 
ing Ras-ERK activation and induction of long-term potentiation, a form of 
neuronal plasticity that contributes to memory storage in the brain [Tian 
et al. (2004); Lu et al. (2004)]. Finally, the insulin receptor gene (INSR) is 
also identified. INSR binds insulin (INS) and regulates energy metabolism. 
Evidence from model organisms, including results from fruit flies [Tatar et 
al. (2001)] and roundworms [Kimura et al. (1997)], relates INSR homologues 
to aging, most likely as part of the GH1/IGF1 axis. These results indicated 
that our method can indeed recover some biologically interesting molecular 
modules or KEGG subnetworks that are related to brain aging in humans. 

It is important to point out that the adaptive Grace identified several 
small sets of connected genes that were missed by Enet or the standard 
Grace. One of the subnetworks included EPHRIN and Eph receptor, both 
of which were found to be related to neural development and entohino- 
hippocampal axon targeting [Flanagan and Vaderhaeghen (1998); Stein et 
al. (1999)]. Another subnetwork was part of the Jak-State signaling, which is 
important in both mature and aging brains [De-Frajaa et al. (2000)]. Aging 
was also found to be associated with increased human T cell CC chemokine 
receptor gene expression [Yung et al. (2003)]. Other interesting subnetworks 
included PVRL3-PVRL1 that are associated with cell adhesion. 

6. Discussion. We have introduced and studied the theoretical proper- 
ties of a graph-constrained regularized estimation procedure for linear re- 
gressions in order to incorporate information coded in graphs. Such a reg- 
ularization procedure can also be regarded as a penalized least squared es- 
timation where the penalty is defined as a combination of the L\ and 
penalties on degree-scaled differences of coefficients between variables linked 
on the graphs. This penalty function induces both sparsity and smoothness 
with respect to the graph structure of the regression coefficients. Simulation 
studies indicated that when the coefficients are similar for variables that are 
neighbors on the graph, the proposed procedure has better prediction and 
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Fig. 2. Subnetworks identified by (a) Elastic net (Enet), (b) Grace and (c) aGrace for 
brain aging gene expression data (only those genes that are linked on the KEGG network 
are plotted). 
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identification performance than other commonly used regularization proce- 
dures such as Lasso and elastic net regressions. Such improvement results 
from effectively utilizing the neighboring information in estimating the re- 
gression coefficients. If the smoothness assumption on the coefficients does 
not hold, we expect that the cross-validation selects a very small value of A2 
and, therefore, the proposed procedure would perform similarly as the Lasso. 
In analysis of the brain aging gene expression data, different from Lasso, 
the new procedure tends to identify sets of linked genes on the networks, 
which often leads to better biological interpretation of the genes identified. 
Although the methods presented are largely motivated by applications in 
genomic data, they can be applied to other settings when the covariates are 
nodes on general graphs, such as in image analysis. 

Although the methods presented in this paper were developed mainly for 
linear models, similar methods can be developed for the generalized linear 
models and the censored survival data regression models, where we can use 
the negative of the logarithm of the likelihood or partial likelihood as the loss 
function. Similar to the techniques presented in Friedman et al. (2007) and 
Wu and Lange (2008), we can use the coordinate descent procedure together 
with the iterative reweighted least square to obtain the solution path. Such 
models have great applications in genomic data analysis in identifying the 
genes or subnetworks that are associated with binary or censored survival 
data outcomes. Other extensions include replacing the L\ part of the Grace 
penalty with other sparse penalty functions such as SCAD or bridge penalty 
[Huang et al. (2008)]. Important future research also includes how to handle 
covariates that are linked on directed graphs. Finally, to incorporate the fact 
that the linked nodes might be negatively correlated and the corresponding 
regression coefficients may have different signs, we introduced an adaptive 
sign-adjusted graph-constrained regularization procedure and showed that 
such a procedure can perform better than the original graph-constrained 
regularization. The theoretical property of such an adaptive procedure is 
unknown and is an area for future research. 

Acknowledgments. We thank the two reviewers for their comments that 
have greatly improved the presentation of this paper. 

SUPPLEMENTARY MATERIAL 

Proofs of Theorem 3.1 and Theorem 3.3 (DOI: 10.1214/10-AOAS332SUPP; 
.pdf). We present the details of the proofs of Theorem 3.1 and Theorem 3.3 
in the Supplemental Materials. 
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